Search Results: "huber"

31 May 2012

Russell Coker: Links May 2012

Vijay Kumar gave an interesting TED talk about autonomous UAVs [1]. His research is based on helicopters with 4 sets of blades and his group has developed software to allow them to develop maps, fly in formation, and more. Hadiyah wrote an interesting post about networking at TED 2012 [2]. It seems that giving every delegate the opportunity to have their bio posted is a good conference feature that others could copy. Bruce Schneier wrote a good summary of the harm that post-911 airport security has caused [3]. Chris Neugebauer wrote an insightful post about the drinking culture in conferences, how it excludes people and distracts everyone from the educational purpose of the conference [4]. Matthew Wright wrote an informative article for Beyond Zero Emissions comparing current options for renewable power with the unproven plans for new nuclear and fossil fuel power plants [5]. The Free Universal Construction Kit is a set of design files to allow 3D printing of connectors between different types of construction kits (Lego, Fischer Technic, etc) [6]. Jay Bradner gave an interesting TED talk about the use of Open Source principles in cancer research [7]. He described his research into drugs which block cancer by converting certain types of cancer cell into normal cells and how he shared that research to allow the drugs to be developed for clinical use as fast as possible. Christopher Priest wrote an epic blog post roasting everyone currently associated with the Arthur C. Clarke awards, he took particular care to flame Charles Stross who celebrated The Prestige of such a great flaming by releasing a t-shirt [8]. For a while I ve been hoping that an author like Charles Stross would manage to make more money from t-shirt sales than from book sales, Charles is already publishing some of his work for free on the Internet and it would be good if he could publish it all for free. Erich Schubert wrote an interesting post about the utility and evolution of Favebook likes [9]. Richard Hartmann wrote an interesting summary of the problems with Google products that annoy him the most [10]. Sam Varghese wrote an insightful article about the political situation in China [11]. The part about the downside of allowing poorly educated people to vote seems to apply to the US as well. Sociological Images has an article about the increased rate of Autism diagnosis as social contagion [12]. People who get their children diagnosed encourage others with similar children to do the same. Vivek wrote a great little post about setting up WPA on Debian [13]. It was much easier than expected once I followed that post. Of course I probably could have read the documentation for ifupdown, but who reads docs when Google is available?

Links March 2012 Washington s Blog has an informative summary of recent articles about...
Links April 2012 Karen Tse gave an interesting TED talk about how to...
Links February 2012 Sociological Images has an interesting article about the attempts to...

11 April 2012

Erich Schubert: Are likes still worth anything?

When Facebook became "the next big thing", you had the "like" buttons pop up on various web sites. An of course "going viral" was the big thing everybody talked about, in particular SEO experts (or those that would like to be that).

But things have changed. In particular Facebook has. In the beginning, any "like" would be announced in the newsfeed to all your friends. This was what allowed likes to go viral, when your friends re-liked the link. This is what made it attractive to have like buttons on your web pages. (Note that I'm not referring to "likes" of a single Facebook post; they are something quite different!)

Once that everybody "knew" how important this was, everbody tried to make the most out of it. In particular scammers, viruses and SEO people. Every other day, some clickjacking application would flood Facebook with likes. Every backwater website was trying to get more audience by getting "liked". But at some point Facebook just stopped showing "likes". This is not bad. It is the obvious reaction when people get too annoyed by the constant "like spam". Facebook had to put an end to this.

But now that a "like" is pretty much worthless (in my opinion). Still, many people following "SEO Tutorials" are all crazy about likes. Instead, we should reconsider whether we really want to slow down our site loading by having like buttons on every page. A like button is not as lightweight as you might think it is. It's a complex JavaScript that tries to detect clickjacking attacks, and in fact invades your users' privacy, up to the point where for example in Germany it may even be illegal to use the Facebook like button on a web site.

In a few months, the SEO people will realize that the "like"s are a fad now, and will likely all try to jump the Google+ bandwagon. Google+ is probably not half as much a "dud" as many think it is (because their friends are still on Facebook and because you cannot scribble birthday wishes on a wall in Google+). The point is that Google can actually use the "+1" likes to improve everyday search results. Google for something a friend liked, and it will show up higher in the search results, and Google will show the friend who recommended it. Facebook cannot do this, because it is not a search engine (well, you can use it for searching people, although Ark probably is better at this, and one does nowhere search as many people as one does regular web searches). Unless they go into a strong partnership with Microsoft Bing or Yahoo, the "like"s can never be as important as Google "+1" likes. So don't underestimate the Google+ strategy on the long run.

There are more points where Facebook by now is much less useful as it used to be. For example event invitations. When Facebook was in full growth, you could essentially invite all your friends to your events. You could also use lists to organize your friends, and invite only the appropriate subset, if you cared enough. The problem again was: nobody cared enough. Everybody would just invite all their friends, and you would end up getting "invitation spam" several times a day. So again Facebook had to change and limit the invitation capabilities. You can no longer invite all, or even just all on one particular list. There are some tools and tricks that can work around this to some extend, but once everybody uses that, Facebook will just have to cut it down even further.

Similarly, you might remember "superpoke" and all the "gift" applications. Facebook (and the app makers) probably made a fortune on them with premium pokes and gifts. But then this too reached a level that started to annoy the users, so they had to cut down the ability of applications to post to walls. And boom, this segment essentially imploded. I havn't seen numbers on Facebook gaming, and I figure that by doing some special setup for the games Facebook managed to keep them somewhat happy. But many will remember the time when the newsfeed would be full of Farmville and Mafia Wars crap ... it just does no longer work this way.

So when working with Facebook and such, you really need to be on the move. Right now it seems that groups and applications are more useful to get that viral dream going. A couple of apps such as Yahoo currently require you to install their app (which then may post to your wall on your behalf and get your personal information!) to follow a link shared this way, and then can actively encourage you to reshare. And messages sent to a "Facebook group" are more likely to reach people that aren't direct friends of yours. When friends actually "join" an event, this is currently showing up in the news feed. But all of this can change with 0 days notice.

It will be interesting to see if Facebook can on the long run keep up with Googles ability to integrate the +1 likes into search results. It probably takes just a few success stories in the SEO community to become the "next big thing" in SEO to get +1 instead of Facebook likes. Then Google just has to wait for them to virally spread +1 adoption. Google can wait - its Google Plus growth rates aren't bad, and they have a working business model already that doesn't rely on the extra growth - they are big already and make good profits.

Facebook however is walking on a surprisingly thin line. They need a tight control on the amount of data shared (which is probably why they try to do this with "magic"). People don't want to have the impression that Facebook is hiding something from them (although it is in fact suppressing a huge part of your friends activity!), but they also don't want to get all this data spammed onto them. And in particular, it needs to give the web publishers and app developers the right amount of that extra access to the users, while in turn keeping the major spam away from the users.

Independent of the technology and actual products, it will be really interesting to see if we manage to find some way to keep the balance in "social" one-to-many communication right. It's not a fault of Facebook that many people "spam" all their friends with all their "data". Googles Circles probably isn't the final answer either. The reason why email still works rather well was probably because it makes one-to-one communication easier than one-to-many, because it isn't realtime, and because people expect you to put enough effort into composing your mails and choosing the right receipients for the message. Current "social" communication is pretty much posting everything to everyone you know adressed as "to whoever it may concern". Much of it is in fact pretty non-personal or even non-social. We have definitely reached the point where more data is shared than is being read. Twitter is probably the most extreme example of a "write-only" medium. The average number a tweet is read by a human except the original poster must be way below 1, and definitely much less than the average number of "followers".

So in the end, the answer may actually be a good automatic address book, with automatic groups and rich clients, to enable everybody to easily use email more efficiently. On the other hand, separting "serious" communication from "entertainment" communication may be well worth having a separate communications channel, and email definitely is dated and is having spam problems.

15 March 2012

Erich Schubert: ELKI applying for GSoC 2012

I've submitted an organization application to the Google Summer of Code 2012 for ELKI - an open source data mining framework I'm working on.

I hope I can get spots for 1-2 students to help implementing additional state of the art methods, to allow for even more broad comparisons. Acceptance notification will be tomorrow. I have no idea how high out chances are of getting accepted. We're open source, as part of the university we have a proven record of educating students (in fact, around two dozen have by now contributed to ELKI, although not all of this has been "released" yet). We're rather small compared to e.g. Gnome, Debian or Apache and just a few years old. But I believe we are trying to fill an important gap in the research and open source interaction: way too many algorithms get published in science, but are never made available as source code to test and compare. This is where we try to step in: make many largely overlooked methods available. Of course we also have k-means (who hasn't?), but there is much more than just k-means! And there are plenty of methods often cited in scientific literature, yet nobody seems to have a working implementation of them ...

While most people equal data mining to prediction and classification (both of which are actually more machine learning topics), ELKI is strong on cluster analysis and outlier detection as well as index structures. Plus, it is much more flexible than other frameworks. For example, we allow almost arbitrary combinations of algorithms and distance functions. So with our tool, you can easily test your own distance function by plugging it into various algorithms. Otherwise we could just have used Weka.

The other reason why we did not extend Weka (or used R) is that we actually wanted to not just implement some algorithms, but actually be able to study the effects of index structures on the algorithms. And in my opinion, this is actually the key difference between true data mining and ML, AI or "statistics". In ML and statistics, the key objective is the result quality. Which is often rather easy to measure, too. For "full" data mining, one also needs to consider all the issues of actually managing the data, indexing it to accelerate computations. Plus, the prime objective is to discover something new, that you did not know before.

Of course these things cannot be completely separated. You can of course discover patterns and rules in the data that will allow you to make good predictions. But data mining is not just prediction and classification!

The next release of ELKI, 0.5, will be released in early april. ELKI 0.4 is already available in Debian testing and Ubuntu. The next release has a focus on comparing clustering results, including an experimental visualization for clustering differences, which will be presented at the ICDE 2012 conference.

On the big to-do list is a lot of stuff. In particular, ELKI is currently mostly useful for research, as it is too difficult to use for the average business guy wanting to do data mining. You can think of it more as a prototyping system: try out what works for you, then implement that within your larger project in an integrated and optimized way. The other big thing that I'm unhappy with is the visualization speed. SVG is great for print export, but Batik can be really slow, and the XML-DOM-API isn't exactly "accessible" to students wanting to add new visualizations. Right now, it is all useful and okay for my kind of experiments, but it could be so much more if we could solve the speed (and memory) issues of the pure SVG based approach. I'd love to see something much faster here, but with a SVG export for editing and printing.

14 March 2012

Erich Schubert: Google Scholar, Mendeley and unreliable sources

Google Scholar and Mendeley need to do more quality control.

Take for example the article

A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms
Hans-Peter Kriegel, Peer Kr ger, Erich Schubert, Arthur Zimek
Scientific and Statistical Database Management (SSDBM 2008)

(part of my diploma thesis).

Apparently, someone screwed up entering the data into Mendeley and added the editors to the authors. Now Google happily imported this data into Google Scholar, and keeps on reporting the authors incorrectly, too. Of course, many people will again import this incorrect data into their bibtex files, upload it to Mendeley and others...

Yet, neither Google Scholar nor Mendeley has an option for reporting such an error. They don't even realize that maybe Springerlink - where the DOI points to - is the more reliable source.

On the contrary, Google Scholar just started suggesting to me that the editors are coauthors ...

They really need to add an option to fix such errors. There is nothing wrong with having errors in gathered data, but you need to have a way of fixing them.

25 January 2012

Russell Coker: SE Linux Status in Debian 2012-01

Since my last SE Linux in Debian status report [1] there have been some significant changes. Policy Last year I reported that the policy wasn t very usable, on the 18th of January I uploaded version 2:2.20110726-2 of the policy packages that fixes many bugs. The policy should now be usable by most people for desktop operations and as a server. Part of the delay was that I wanted to include support for systemd, but as my work on systemd proceeded slowly and others didn t contribute policy I could use I gave up and just released it. Systemd is still a priority for me and I plan to use it on all my systems when Wheezy is released. Kernel Some time between Debian kernel 3.0.0-2 and 3.1.0-1 support for an upstream change to the security module configuration was incorporated. Instead of using selinux=1 on the kernel command line to enable SE Linux support the kernel option is security=selinux. This change allows people to boot with security=tomoyo or security=apparmor if they wish. No support for Smack though. As the kernel silently ignores command line parameters that it doesn t understand so there is no harm in having both selinux=1 and security=selinux on both older and newer kernels. So version 0.5.0 of selinux-basics now adds both kernel command-line options to GRUB configuration when selinux-activate is run. Also when the package is upgraded it will search for selinux=1 in the GRUB configuration and if it s there it will add security=selinux. This will give users the functionality that they expect, systems which have SE Linux activated will keep running SE Linux after a kernel upgrade or downgrade! Prior to updating selinux-basics systems running Debian/Unstable won t work with SE Linux. As an aside the postinst file for selinux-basics was last changed in 2006 (thanks Erich Schubert). This package is part of the new design of SE Linux in Debian and some bits of it haven t needed to be changed for 6 years! SE Linux isn t a new thing, it s been in production for a long time. Audit While the audit daemon isn t strictly a part of SE Linux (each can be used without the other) it seems that most of the time they are used together (in Debian at least). I have prepared a NMU of the new upstream version of audit and uploaded it to delayed/7. I want to get everything related to SE Linux up to date or at least with comparable versions to Fedora. Also I sent some of the Debian patches for the auditd upstream which should reduce the maintenance effort in future. Libraries There have been some NMUs of libraries that are part of SE Linux. Due to a combination of having confidence in the people doing the NMUs and not having much spare time I have let them go through without review. I m sure that I will notice soon enough if they don t work, my test systems exercise enough SE Linux functionality that it would be difficult to break things without me noticing. Play Machine I am now preparing a new SE Linux Play Machine running Debian/Unstable. I wore my Play Machine shirt at LCA so I ve got to get one going again soon. This is a good exercise of the strict features of SE Linux policy, I ve found some bugs which need to be fixed. Running Play Machines really helps improve the overall quality of SE Linux.

[1] http://etbe.coker.com.au/2011/10/31/selinux-status-2011-10/

Status of SE Linux in Debian LCA 2009 This morning I gave a talk at the Security mini-conf...
SE Linux in Debian I have now got a Debian Xen domU running the...
Debian SE Linux Status At the moment I ve got more time to work on...

9 October 2011

Erich Schubert: Class management system

Dear Lazyweb.

A friend of mine is looking for a small web application to manage tiny classes (as in course, not as in computing). They usually span just four dates, and people will often sign up for the next class then. Usually 10-20 people per class, although some might not sign up via internet.
We deliberately don't want to require them to fully register for the web site and go through all that registration, email verification etc. trouble. Anything that takes more than filling out the obviously required form will just cause trouble.

At first it sounded like this is a common task, but in essence all the systems I've seen so far are totally overpowered for this. There are no grades, no working groups, no "customer relationship management". There isn't much more needed than the ability to easily configure the classes, have people book them, and get the list of singed up users into a spreadsheet easily (CSV will do).

It must be able to run on the typical PHP+MySQL web hoster and be open source.

Any recommendations? Drop me a comment or email at erich () debian org Thank you.

28 September 2011

Erich Schubert: Privacy in the public opinion

Many people in the united states seem to have the opinion, that the "public is willing to give up most of their privacy" in particular when dealing with online services such as Facebook. I believe in his keynote at ECML-PKDD, Albert-L szl Barab si of Harvard University expessed such a view, that this data will just become more and more available. I'm not sure if it was him or someone else (I believe it was someone else) that essentially claimed "privacy is irrelevant". Another popular opinion is that "it's only old people caring for privacy".

However, just like politics, these things tend to oscillate from one extreme to another. For example, the recent years in Europe, conservative parties were winning one election after another. Now in France, the socialist parties have just won the senate, the conservative parties in Germany are losing in one state after the other and so on. And this will change back again, too. Democracy also lives from changing roles in government, as this drives both progress and fights corruption.

We might be seeing the one extreme in the united states right now, where people are readily giving away their location and interests for free access to a web site. This can swing back any time.

In Germany, one of the government parties - the liberal democrats, FDP - just dropped out of the Berlin city government, down to 1.8% of voters. Yes, this is the party the German foreign minister, Guido Westerwelle is from. The pirate party [en.wikipedia.org] - much of their program is about privacy, civil rights, copyright reforms and the internet - which didn't even participate in the previous elections since they were was founded just 5 years ago jumped to 8.9%, scoring higher than the liberal democrats did in the previous elections. In 2009 they scored a surprising high 2% in the federal elections - current polls see them anywhere from 4% to 7% at the federal level, so they will probably get seats in parliament in 2013. (There are also other reasons why the liberal democrats have been losing voters so badly, though! Their current numbers indicate they might drop out of parliament in 2013.)

The Greens in Germany, which are also very much oriented towards privacy and civil rights, are also on the rise, and in march just became the second strongest party and senior partner in the governing coalition of Baden-W rttemberg, which historically was a heartland of the conservatives.

So don't assume that privacy is irrelevant nowadays. The public opinion can swing quickly. In particular in democratic systems that have room for more than two parties - so probably not in the united states - such topics can actually influence elections a lot. Within 30 years, the Greens now frequently reach values of 20% in federal polls and up to 30% in some states. It doesn't look as if they are going to go away soon.

Also don't assume that it's just old people caring about privacy - in Germany, in particular the pirate party and the Greens are very much favored by the young people. The typical voter for the pirates is less than 30 years old, male, has a higher education and works in the media or internet business.

In Germany, much of the protest for more privacy - and against the too readily data collection by companies such as Facebook and Google - is driven by the young internet-users and -workers. I believe this will be similar in other parts of Europe - there are other pirate parties all over Europe. And this can happen to the united states any time, too.

Electronic freedom - e.g. pushed by the Electronic Frontier Foundation, but also the open source movement - does have quite a history in the united states. But in particular open source has made such a huge progress the last decade, these movements in the US could just be a bit out of breath right now. I'm sure they will come back with a strong push against the privacy invasions we're seeing right now. And that can likely take down a giant like Facebook, too. So don't bet on people continuing to give up their privacy!

29 August 2011

Erich Schubert: ELKI 0.4 beta release

Two weeks ago, I've published the first beta of the upcoming ELKI 0.4 release. The accompanying publication at SSTD 2011 won the "Best Demonstration Paper Award"!

ELKI is a Java framework for developing data mining algorithms and index structures. It has indexes such as the R*-Tree and M-Tree, and a huge collection of algorithms and distance functions. These are all writen rather generic, so you can build all the combinations of indexes, algorithms and distances. There are evaluation and visualization modules.

Note that I'm using "data mining" in the broad, original sense that focuses on knowledge discovery by unsupervised methods such as clustering and outlier detection. Today, many people just think of machine learning and "artificial intelligence" - or even worse: large scale data collecting - when they hear data mining. But there is much more to it than just learning!

Java comes at a certain price. The latest version got already around 50% faster than the previous release just by reducing Java boxing and unboxing that puts quite some pressure on the memory management. So you could implement these things in C to become a lot faster; but this is not production software. I need code that I can put students on to work with it and extend it, this is much more important than getting the maximum speed. You can probably still use this for prototyping. See what works, then implement just that which you really need in a low level language for maximum performance.

You can do some of that in Java. You could work on a large chunk of doubles, and access them via the Unsafe class. But is that then still Java, or aren't you actually doing just plain C? In our framework, we want to be able to support non-numerical vectors and non-double distances, too. Even when they are only applicable to certain specialized use cases. Plus, generic and Java-style code is usually much more readable, and the performance cost is not critical for research use.

Release 0.4 has plenty of under the hood changes. It allows multiple indexes to exist in parallel, it support multi-relational data. There are also a dozen new algorithms, mostly from the geo/spatial outlier field, which were used for the demonstration. But for example, it also includes methods for rescaling the output of outlier detection methods to a more sensible numerical scale for visualization and comparison.

You can install ELKI on a Debian testing and unstable system by the usual "aptitude install elki" command. It will install a menu entry for the UI and also includes the command-line launcher "elki-cli" for batch operation. The "-h" flag can produce an extensive online help, or you can just copy the parameters from the GUI. By reusing Java packages such as Batik and FOP already in Debian, this also is a smaller download. I guess the package will at some point also transition to Ubuntu - since it is Java you can just download and install it anyway I guess.

22 August 2011

Erich Schubert: Missing in Firefox: anon and regular mode in parallel

The killer feature that Chrome has and Firefox is missing is quite simple: the ability to have "private" aka "anonymous" and non-private tabs open at the same time. As far as I can tell, with Firefox you can only be in one of these modes.

My vision of a next generation privacy browser would essentially allow you to have tabs in individual modes. With some simple rules for mode switching. Essentially, unknown sites should alwasy be opened in anonymous mode. Only sites that I register for should automatically switch to a tracked mode where cookies are kept, for my own convenience. And then there are sites in a safety category that should be isolated from any other contents such as my banking site. Going to these sites should require me to manually switch modes (except when using my bookmark). Embedding, framing and such things to these sites should be impossible.

On a side note, having TOR tabs would also be nice.

16 August 2011

Erich Schubert: Documenting fat-jar licenses

Dear Lazyweb,
What is the appropriate way to document the individual licenses of a fat jar (a jar archive that includes the program along with all its dependencies)?

I've been living in the happy world of Linux distributions, where one would just package the dependencies independently (or just specify which existing packages you use, actually, since most is already packaged by helpful other developers). But for the non-Linux people, a fat jar seems to be important so they can double-click it to run the application.

When building larger Java applications, you end up using a couple of external code such as various Apache Commons and Apache Batik. I'm currently including all their LICENSE.txt files in a directory named "legal", and I'm trying to make it as obvious as possible which license applies to which parts of the jar archive. Is there any best-practice of doing this? I don't want to reinvent the wheel; it'd also like to avoid any common legal pitfalls, obviously.

Feel free to respond either using the Disqus comment function or by email via erich () debian org

26 July 2011

Erich Schubert: Restricting Skype via iptables

Whenever I launch Skype on my computer, it gets banned from the university network within a few minutes; the ban expires again after a few minutes when I close Skype. This is likely due to the aggresive nature of Skype, maybe the firewalls think it is trying to do a DDoS attack. One of the known big issues of using Skype.

For Windows users, there are some known workaround to limit Skype that usually involve registry editing. These are however not available on Linux, unfortunately.

Therefore, I decided to play around with advanced iptables functionality. While you cannot match the originating process reliably (the owner match module seemed to include such functionality at some point, but it was deemed unreliable on multi-core systems). However, there are other and more efficient methods of achieving the same.

Here's my setup:

# Add a system group for Skype
addgroup --system skype
# Override permissions of skype (assuming Debian package!)
dpkg-statoverride --update --add root skype 2755  which skype

And these are the iptables rules I use:

iptables -I OUTPUT -p tcp -m owner --gid-owner skype \
    -m multiport ! --dports 80,443 -j REJECT
iptables -I OUTPUT -p udp -m owner --gid-owner skype -j REJECT

They allow outgoing connections by Skype only on ports 80 and 443, which supposedly do not trigger the firewall (in fact, this filter is recommended by our network administration for Skype).

Or wrapped as pyroman (my firewall configuration tool; aptitude install pyroman) module:

"""
Skype restriction to avoid firewall block.
Raw iptables commands.
"""
iptables(Firewall.output, "-p tcp -m owner --gid-owner skype -m multiport ! --dports 80,443 -j %s" % Firewall.reject)
iptables(Firewall.output, "-p udp -m owner --gid-owner skype -j %s" % Firewall.reject)

which I've put just after the conntrack default module, as 05_skype.py

3 July 2011

Erich Schubert: Google vs. Facebook

Let the games begin. It looks like Google and Facebook are going to fight it out.

Given the spectacular failures of Wave and Buzz, people are of course skeptic about the success of Google Plus. However, I'm rather confident it is going to stay. Here are some things I like about it:

Privacy: as far as I can tell, Plus design had started with privacy in mind, whereas for Facebook it is still an unloved child, a spare tire. Facebook keeps on getting bad reviews here; people don't get their UI and mis-share things. I read somewhere that FB is actually losing users in the US: kids who leave Facebook because their parents are getting on, and they don't want them to see what they shared with their friends.
UI: the Circles UI is very easy to use. The same functionality on Facebook is awful to manage. And the notifications UI in Plus also are a lot better than the tiny indicator in Facebook.
New stuff: Hangout and group chats are pretty interesting (well, actually I don't like Video at all ...). This puts pressure on Facebook to move, and it looks like they might present some Skype integration soon. But they need to make this really good to keep up with hangout. Just adding "Call me" buttons for all users that have specified a Skype account in their profile won't do the trick.
Future integration: Wave already had many of these things, but the wrong way. You would put a map into a chat; the Plus way will probably be to share a map session with a circle and add chat and collaborative editing there. Google "Office" is another place where it is trivial but very useful to add Plus.

Some people think that Google will not stand a chance against social giant Facebook. But after all, Google has more users - and they have tons of services people like to use. So when Google Maps has Plus integration, will the users use it to chat about where to meet, or will they go the long way and post the map URL on Facebook, without the option of updating it collaboratively?

Googles position is much stronger than many people believe, once you think about integration possibilities. The current Plus is just a fragment, the missing puzzle piece connecting the other apps. But imagine that Google now Plus-connects its various services: YouTube, Maps, Mail, Talk, ... - for them, this is just a matter of some engineering. I guess probably half of these are already in internal testing. And Facebook just can't keep up there. Sure, they do have Facebook Video. But actually, most people prefer YouTube. And while Google can integrate plus all the way there, Facebook cannot. And while Google is the master of search, Facebook is particularly weak there - they can't even properly search their own data. Google however will at some point offer an "find things that interest me" button; Sparks is just the beginning where you have to manually define your interests (which will probably remain an option due to privacy concerns!); it is way too static right now.

So essentially, Google doesn't need to copy Facebook. They just need to do what is obvious on their own products, and Facebook will have a hard time keeping up.

Plus, in my opinion, Google got the timing just right. Users aren't too happy with Facebook these days, there is just no big alternative around anymore; their friends are on Facebook and not on some other social network. Facebook doesn't seem to evolve much anymore. The mail functionality opened more security holes (apparently you can post to a group with a fake user name when you spoof the senders email address) than it contributed to functionality and usefulness. Privacy is still not in line with all countries such as Germany; but Facebook keeps telling those users essentially that they don't care. Spam and fraud still reappears every month following the same pattern again and again (Clickjacking). The search function of Facebook is still usually described as "useless" ... People waste time in games and annoy their friends with random game invitations and posts. Facebook should better make a major move now, too. More than a demo of Skype integration. But whenever Facebook changed, their users complained ...

Of course, Google+ still has a long way to go, too. There are still many things missing here, too. For example groups and events. I figure Google is already testing them in "dogfood", and they'll actually come out within the month. With groups I do not refer to the existing Google product, but to what would be "public circles" that you need to join and that are accessible to all the circle members instead of just the creator. And events are also a key function; probably one of the most used on Facebook. These may require much more careful design to integrate well with Calendar. But given the visual update of Calender these days, this may just be around the corner, too.

4 June 2011

Mike (stew) O'Connor: My Movein Script

Erich Schubert's blog post reminded me that I've been meaning to writeup a post detailing how I'm keeping parts of my $HOME in git repositories. My goal has been to keep my home directory in a version control system effectively. I have a number of constraints however. I want the system to be modular. I don't always need X related config files in my home directory. Sometimes I want just my zsh related files and my emacs related files. I have multiple machines I check email from, and on those want to keep my notmuch/offlineimap files in sync, but I don't need these on every machine I'm on, expecially since those configurations have more sensitive data. I played around with laysvn for a while, but it never really seemed comfortable. I more recently discovered that madduck had started a "vcs-home" website and mailing list, talking about doing what I'm trying to do. I'm now going with madduck's idea of using git with detached work trees, so that I can have multiple git repositories all using $HOME as their $GIT_WORK_TREE. I have a script inspired by his vcsh script that will create a subshell where the GIT_DIR, GIT_WORK_TREE variables are set for me. I can do my git operations related to just one git repository in that shell, while still operating directly on my config files in $HOME, and avoiding any kind of nasty symlinking or hardlinking. Since I am usually using my script to allow me to quickly "move in" to a new host, I named my script "movein". It can be found here. Here's how I'll typically use it:

    stew@guppy:~$ movein init
    git server hostname? git.vireo.org
    path to remote repositories? [~/git] 
    Local repository directory? [~/.movein] 
    Location of .mrconfig file? [~/.mrconfig] 
    stew@guppy:~$

This is just run once. It asks me questions about how to setup the 'movein' environment. Now I should have a .moveinrc storing the answers I gave above, I have a stub of a .mrconfig, and an empty .movein directory. Next thing to do is to add some of my repositories. The one I typically add on all machines is my "shell" repository. It has a .bashrc/.zshrc, an .alias that both source and other zsh goodies I'll generally wish to be around:

    stew@guppy:~$ ls .zshrc
    ls: cannot access .zshrc: No such file or directory
    stew@guppy:~$ movein add shell
    Initialized empty Git repository in /home/stew/.movein/shell.git/
    remote: Counting objects: 42, done.
    remote: Compressing objects: 100% (39/39), done.
    remote: Total 42 (delta 18), reused 0 (delta 0)
    Unpacking objects: 100% (42/42), done.
    From ssh://git.vireo.org//home/stew/git/shell
     * [new branch]      master     -> origin/master
    stew@guppy:~$ ls .zshrc
    .zshrc

So what happened here is that the ssh://git.vireo.org/~/git/shell.git repository was cloned with GIT_WORK_TREE=~ and GIT_DIR=.movein/shell.git. My .zshrc (along with a bunch of other files) has appeared. Next perhaps I'll add my emacs config files:

    stew@guppy:~$ movein add emacs       
    Initialized empty Git repository in /home/stew/.movein/emacs.git/
    remote: Counting objects: 77, done.
    remote: Compressing objects: 100% (63/63), done.
    remote: Total 77 (delta 10), reused 0 (delta 0)
    Unpacking objects: 100% (77/77), done.
    From ssh://git.vireo.org//home/stew/git/emacs
     * [new branch]      emacs21    -> origin/emacs21
     * [new branch]      master     -> origin/master
    stew@guppy:~$ ls .emacs
    .emacs
    stew@guppy:~$

My remote repositry has a master branch, but also has an emacs21 branch, which I can use when checking out on older machines which don't yet have newer versions of emacs. Let's say I have made changes to my .zshrc file, and I want to check them in. Since we are working with detached work trees, git can't immediately help us:

    stew@guppy:~$ git status
    fatal: Not a git repository (or any of the parent directories): .git

The movein script allows me to "login" to one of the repositories. It will create a subshell with GIT_WORK_TREE and GIT_DIR set. In that subshell, git operations operate as one might expect:

    stew@guppy:~ $ movein login shell
    stew@guppy:~ (shell:master>*) $ echo >> .zshrc
    stew@guppy:~ (shell:master>*) $ git add .zshrc                                       
    stew@guppy:~ (shell:master>*) $ git commit -m "adding a newline to the end of .zshrc"
    [master 81b7311] adding a newline to the end of .zshrc
     1 files changed, 1 insertions(+), 0 deletions(-)
    stew@guppy:~ (shell:master>*) $ git push
    Counting objects: 8, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (6/6), done.
    Writing objects: 100% (6/6), 546 bytes, done.
    Total 6 (delta 4), reused 0 (delta 0)
    To ssh://git.vireo.org//home/stew/git/shell.git
       d24bf2d..81b7311  master -> master
    stew@guppy:~ (shell:master*) $ exit
    stew@guppy:~ $

If I want to create a brand new repository from files in my home directory. I can:

    stew@guppy:~ $ touch methere
    stew@guppy:~ $ touch mealsothere
    stew@guppy:~ $ movein new oohlala methere mealsothere
    Initialized empty Git repository in /home/stew/git/oohlala.git/
    Initialized empty Git repository in /home/stew/.movein/oohlala.git/
    [master (root-commit) 7abe5ba] initial checkin
     0 files changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 mealsothere
     create mode 100644 methere
    Counting objects: 3, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (2/2), done.
    Writing objects: 100% (3/3), 224 bytes, done.
    Total 3 (delta 0), reused 0 (delta 0)
    To ssh://git.vireo.org//home/stew/git/oohlala.git
     * [new branch]      master -> master

Above, the command movein new oohlala methere mealsothere says "create a new repository containing two files: methere, mealsothere". A bare repository is created on the remote machine, a repository is created in the .movein directory, the files are committed, and the new commit is pushed to the remote repository. New on some other machine, I could run movein add oohlala to get these two new files. The movein script maintains a .mrconfig file, so that joeyh's mr tool can be used to manage the repositories in bulk. Commands like "mr update", "mr commit", "mr push" will act on all the known repositories. Here's an example:

    stew@guppy:~ $ cat .mrconfig
    [DEFAULT]
    include = cat /usr/share/mr/git-fake-bare
    [/home/stew/.movein/emacs.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/emacs.git' 'emacs.git' '../../'
    [/home/stew/.movein/shell.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/shell.git' 'shell.git' '../../'
    [/home/stew/.movein/oohlala.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/oohlala.git' 'oohlala.git' '../../'
    stew@guppy:~ $ mr update
    mr update: /home/stew//home/stew/.movein/emacs.git
    From ssh://git.vireo.org//home/stew/git/emacs
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: /home/stew//home/stew/.movein/oohlala.git
    From ssh://git.vireo.org//home/stew/git/oohlala
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: /home/stew//home/stew/.movein/shell.git
    From ssh://git.vireo.org//home/stew/git/shell
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: finished (3 ok)
    stew@guppy:~ $ mr update

There are still issues I'd like to address. The big one in my mind is that there is no .gitignore. So when you "movein login somerepository" then run "git status", It tells you about hundreds of untracked files in your home directory. Ideally, I just want to know about the files which are already associated with the repository I'm logged into.

Mike (stew) O'Connor: My Movin Script

    stew@guppy:~$ movein init
    git server hostname? git.vireo.org
    path to remote repositories? [~/git] 
    Local repository directory? [~/.movein] 
    Location of .mrconfig file? [~/.mrconfig] 
    stew@guppy:~$

    stew@guppy:~$ ls .zshrc
    ls: cannot access .zshrc: No such file or directory
    stew@guppy:~$ movein add shell
    Initialized empty Git repository in /home/stew/.movein/shell.git/
    remote: Counting objects: 42, done.
    remote: Compressing objects: 100% (39/39), done.
    remote: Total 42 (delta 18), reused 0 (delta 0)
    Unpacking objects: 100% (42/42), done.
    From ssh://git.vireo.org//home/stew/git/shell
     * [new branch]      master     -> origin/master
    stew@guppy:~$ ls .zshrc
    .zshrc

    stew@guppy:~$ movein add emacs       
    Initialized empty Git repository in /home/stew/.movein/emacs.git/
    remote: Counting objects: 77, done.
    remote: Compressing objects: 100% (63/63), done.
    remote: Total 77 (delta 10), reused 0 (delta 0)
    Unpacking objects: 100% (77/77), done.
    From ssh://git.vireo.org//home/stew/git/emacs
     * [new branch]      emacs21    -> origin/emacs21
     * [new branch]      master     -> origin/master
    stew@guppy:~$ ls .emacs
    .emacs
    stew@guppy:~$

    stew@guppy:~$ git status
    fatal: Not a git repository (or any of the parent directories): .git

The movein script allows me to "login" to one of the repositories. It will create a subshell with GIT_WORK_TREE and GIT_DIR set. In that subshell, git operations operate as one might expect:

    stew@guppy:~ $ movein login shell
    stew@guppy:~ (shell:master>*) $ echo >> .zshrc
    stew@guppy:~ (shell:master>*) $ git add .zshrc                                       
    stew@guppy:~ (shell:master>*) $ git commit -m "adding a newline to the end of .zshrc"
    [master 81b7311] adding a newline to the end of .zshrc
     1 files changed, 1 insertions(+), 0 deletions(-)
    stew@guppy:~ (shell:master>*) $ git push
    Counting objects: 8, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (6/6), done.
    Writing objects: 100% (6/6), 546 bytes, done.
    Total 6 (delta 4), reused 0 (delta 0)
    To ssh://git.vireo.org//home/stew/git/shell.git
       d24bf2d..81b7311  master -> master
    stew@guppy:~ (shell:master*) $ exit
    stew@guppy:~ $

If I want to create a brand new repository from files in my home directory. I can:

    stew@guppy:~ $ touch methere
    stew@guppy:~ $ touch mealsothere
    stew@guppy:~ $ movein new oohlala methere mealsothere
    Initialized empty Git repository in /home/stew/git/oohlala.git/
    Initialized empty Git repository in /home/stew/.movein/oohlala.git/
    [master (root-commit) 7abe5ba] initial checkin
     0 files changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 mealsothere
     create mode 100644 methere
    Counting objects: 3, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (2/2), done.
    Writing objects: 100% (3/3), 224 bytes, done.
    Total 3 (delta 0), reused 0 (delta 0)
    To ssh://git.vireo.org//home/stew/git/oohlala.git
     * [new branch]      master -> master

    stew@guppy:~ $ cat .mrconfig
    [DEFAULT]
    include = cat /usr/share/mr/git-fake-bare
    [/home/stew/.movein/emacs.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/emacs.git' 'emacs.git' '../../'
    [/home/stew/.movein/shell.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/shell.git' 'shell.git' '../../'
    [/home/stew/.movein/oohlala.git]
    checkout = git_fake_bare_checkout 'ssh://git.vireo.org//home/stew/git/oohlala.git' 'oohlala.git' '../../'
    stew@guppy:~ $ mr update
    mr update: /home/stew//home/stew/.movein/emacs.git
    From ssh://git.vireo.org//home/stew/git/emacs
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: /home/stew//home/stew/.movein/oohlala.git
    From ssh://git.vireo.org//home/stew/git/oohlala
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: /home/stew//home/stew/.movein/shell.git
    From ssh://git.vireo.org//home/stew/git/shell
     * branch            master     -> FETCH_HEAD
    Already up-to-date.
    mr update: finished (3 ok)
    stew@guppy:~ $ mr update

2 June 2011

Erich Schubert: Managing user configuration files

Dear Lazyweb,
How do you manage your user configuration files? I have around four home directories I frequently use. They are sufficiently well enough in sync, but I have been considering to actually use some file management to synchronize them better. I'm talking about files such as shell config, ssh config, .vimrc etc.

I had some discussions about this before, and the consensus had been that some version control system probably is best. Git seemed to be a good candidate; I remember having read about things like this a dozen years ago when CVS was still common and Subversion was new.

So dear lazyweb, what are your experiences with managing your user configuration? What setup would you recommend?

Update: See vcs-home for various related links and at least five different ways of doing this. mr, a multi-repository VCS wrapper seems particularly well at this.

27 May 2011

Erich Schubert: Dear Lazyweb, how to write multi-locale python code

Dear Lazyweb,
I've been toying around with a python WSGI application, i.e. a multi-threaded persistent web application. Now I'd like to add multi-language support to this application. I need to format datetimes to human readable formats, but I havn't found a way yet to do this in a sane way using strftime. Essentially, strftime will use the current application locale; however since I'm running multi-threaded, different threads might want to use different locales. So changing the locale is bound to cause race conditions.

So what is the best way to pretty-print (including week day names!) datetime, currency and similar values in a multi-threaded multi-locale context in python? Gettext and manually emulating strftime doesn't sound that sensible to me. And of course, I don't want to have to translate the weekday names myself into any language I choose to support...

12 May 2011

Erich Schubert: AMD64 broken on Debian unstable - avoid libc6 2.13-3

Beware from upgrading on AMD64. Make sure to avoid version 2.1.3-3, as this will render your system unbootable and unusable. As simple as the reason is (a missing link) as severe.

Bug report with instructions on how to recover. If you are lucky you have a root shell open to restore the missing link. Otherwise, you need to reboot with parameters break=init rw, recover the link with cd root; ln -s lib lib64, sync, unmount, reboot. It's not really hard to do when you know how. But it is a lot easier to avoid upgrading to this version. My i386 mirror already has the fixed upload (but i386 is not affected anyway). So by tomorrow, it should be safe again (depening on your mirrors delay).

5 May 2011

Erich Schubert: Upcoming publications in data mining

Upcoming 2011 publications of my research:

Just presented at the SDM11 last weekend:

H.-P. Kriegel, P. Kr ger, E. Schubert, A. Zimek
Interpreting and Unifying Outlier Scores
In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, 2011.

To be presented and published end of August:

T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kr ger, M. Renz, E. Schubert, A. Zimek
Quality of Similarity Rankings in Time Series
In Proceedings of the 12th International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN, 2011.

E. Achtert, A. Hettab, H.-P. Kriegel, E. Schubert, A. Zimek
Spatial Outlier Detection: Data, Algorithms, Visualizations
In Proceedings of the 12th International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN, 2011.

The latter will also accompany the release of version 0.4 of our data mining research software ELKI.

28 April 2011

Erich Schubert: SIAM SDM11 - Unified Outlier Scores

I'm currently in Phoenix, AZ at the 2011 SIAM International Conference on Data Mining.

My contribution is titled "Interpreting and Unifying Outlier Scores", a method that allows the combination, interpretation, visualization etc. of existing outlier algorithms. The method brings back a bit more statistics into a data mining area that has drifted away from the statistical roots.

We apply the method to a couple of outlier detection algorithms and combine them using a naive ensemble approach, that still outperforms existing outlier ensembles.

14 March 2011

Erich Schubert: GNOME3 in Debian experimental - python and dconf

As GNOME3 slowly enters Debian experimental, things become a bit ... experimental.

The file manager can be set to still draw icons on the desktop, but that doesn't entirely work yet (it will also open folders as desktop then...)

One machine had lost the keyboard settings. I could not set the fonts I wanted...

There is a tool called dconf-editor that will allow you to manually tweak some settings such as the fonts. But it doesn't seem to have support for value lists yet - and the keyboard mappings setting is a string list.

So here's sample python code to modify such a value:

from gi.repository import Gio
s = Gio.Settings.new("org.gnome.libgnomekbd.keyboard")
s.set_strv("layouts", ["de"])

Update: you could also install the optional libglib2.0-bin and use the gsettings command.

Next.

Previous.